-
Notifications
You must be signed in to change notification settings - Fork 9
Add python bindings #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, would add one basic encode / decode test
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this in D78053854. |
Summary: This pull request introduces Python bindings for the PyTorch Tokenizers library. It includes changes to support Python bindings in the build system, integration of `pybind11`, and updates to the Python package for distribution. Additionally, it modifies the tokenizer classes and adds testing configurations for the new bindings. ### Python Bindings Integration: * **Added Python bindings option in `CMakeLists.txt`**: Introduced the `TOKENIZERS_BUILD_PYTHON` option and the logic to build Python bindings using `pybind11`. This includes creating the `pytorch_tokenizers_cpp` extension module and linking it with the tokenizers library. [[1]](diffhunk://#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR21) [[2]](diffhunk://#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR125-R156) * **New `src/python_bindings.cpp` file**: Implemented Python bindings for tokenizers using `pybind11`. This includes binding classes like `Tokenizer`, `HFTokenizer`, `Tiktoken`, `Llama2cTokenizer`, and `SPTokenizer`. ### Python Package Updates: * **Updated `setup.py` for Python bindings**: Added support for building the Python extension module using CMake and `pybind11`. This includes defining a custom `CMakeBuild` class for handling the build process. * **Modified `pytorch_tokenizers/__init__.py`**: Updated the package to include the new C++ tokenizer bindings and removed older Python implementations. Added error handling for failed imports. ### Testing Enhancements: * **Added `pytest.ini` configuration**: Configured Pytest for the project, including test discovery rules, ignored directories, and markers for different test types. * **Defined Python tests in targets.bzl**: Introduced a `targets.bzl` target for testing the Python bindings (`test_python_bindings.py`). ### Tokenizer Class Changes: * **Added constructors to `Tiktoken` class**: Introduced new constructors to let pybind11 bind init() to constructors (it doesn't support `std::unique_ptr<std::vector<std::string>>`). ### Build System Changes: * **Added Bazel target for Python bindings**: Defined a `targets.bzl` target for building the Python bindings, including dependencies on tokenizer modules and `pybind11`. Differential Revision: D78053854 Pulled By: larryliu0820
0166966
to
23c8992
Compare
This pull request was exported from Phabricator. Differential Revision: D78053854 |
Summary: This pull request introduces Python bindings for the PyTorch Tokenizers library. It includes changes to support Python bindings in the build system, integration of `pybind11`, and updates to the Python package for distribution. Additionally, it modifies the tokenizer classes and adds testing configurations for the new bindings. ### Python Bindings Integration: * **Added Python bindings option in `CMakeLists.txt`**: Introduced the `TOKENIZERS_BUILD_PYTHON` option and the logic to build Python bindings using `pybind11`. This includes creating the `pytorch_tokenizers_cpp` extension module and linking it with the tokenizers library. [[1]](diffhunk://#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR21) [[2]](diffhunk://#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR125-R156) * **New `src/python_bindings.cpp` file**: Implemented Python bindings for tokenizers using `pybind11`. This includes binding classes like `Tokenizer`, `HFTokenizer`, `Tiktoken`, `Llama2cTokenizer`, and `SPTokenizer`. ### Python Package Updates: * **Updated `setup.py` for Python bindings**: Added support for building the Python extension module using CMake and `pybind11`. This includes defining a custom `CMakeBuild` class for handling the build process. * **Modified `pytorch_tokenizers/__init__.py`**: Updated the package to include the new C++ tokenizer bindings and removed older Python implementations. Added error handling for failed imports. ### Testing Enhancements: * **Added `pytest.ini` configuration**: Configured Pytest for the project, including test discovery rules, ignored directories, and markers for different test types. * **Defined Python tests in targets.bzl**: Introduced a `targets.bzl` target for testing the Python bindings (`test_python_bindings.py`). ### Tokenizer Class Changes: * **Added constructors to `Tiktoken` class**: Introduced new constructors to let pybind11 bind init() to constructors (it doesn't support `std::unique_ptr<std::vector<std::string>>`). ### Build System Changes: * **Added Bazel target for Python bindings**: Defined a `targets.bzl` target for building the Python bindings, including dependencies on tokenizer modules and `pybind11`. Reviewed By: jackzhxng Differential Revision: D78053854 Pulled By: larryliu0820
23c8992
to
c9e9194
Compare
This pull request was exported from Phabricator. Differential Revision: D78053854 |
Summary: This pull request introduces Python bindings for the PyTorch Tokenizers library. It includes changes to support Python bindings in the build system, integration of `pybind11`, and updates to the Python package for distribution. Additionally, it modifies the tokenizer classes and adds testing configurations for the new bindings. ### Python Bindings Integration: * **Added Python bindings option in `CMakeLists.txt`**: Introduced the `TOKENIZERS_BUILD_PYTHON` option and the logic to build Python bindings using `pybind11`. This includes creating the `pytorch_tokenizers_cpp` extension module and linking it with the tokenizers library. [[1]](diffhunk://#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR21) [[2]](diffhunk://#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR125-R156) * **New `src/python_bindings.cpp` file**: Implemented Python bindings for tokenizers using `pybind11`. This includes binding classes like `Tokenizer`, `HFTokenizer`, `Tiktoken`, `Llama2cTokenizer`, and `SPTokenizer`. ### Python Package Updates: * **Updated `setup.py` for Python bindings**: Added support for building the Python extension module using CMake and `pybind11`. This includes defining a custom `CMakeBuild` class for handling the build process. * **Modified `pytorch_tokenizers/__init__.py`**: Updated the package to include the new C++ tokenizer bindings and removed older Python implementations. Added error handling for failed imports. ### Testing Enhancements: * **Added `pytest.ini` configuration**: Configured Pytest for the project, including test discovery rules, ignored directories, and markers for different test types. * **Defined Python tests in targets.bzl**: Introduced a `targets.bzl` target for testing the Python bindings (`test_python_bindings.py`). ### Tokenizer Class Changes: * **Added constructors to `Tiktoken` class**: Introduced new constructors to let pybind11 bind init() to constructors (it doesn't support `std::unique_ptr<std::vector<std::string>>`). ### Build System Changes: * **Added Bazel target for Python bindings**: Defined a `targets.bzl` target for building the Python bindings, including dependencies on tokenizer modules and `pybind11`. Reviewed By: jackzhxng Differential Revision: D78053854 Pulled By: larryliu0820
c9e9194
to
39ed03a
Compare
This pull request was exported from Phabricator. Differential Revision: D78053854 |
Summary: This pull request introduces Python bindings for the PyTorch Tokenizers library. It includes changes to support Python bindings in the build system, integration of `pybind11`, and updates to the Python package for distribution. Additionally, it modifies the tokenizer classes and adds testing configurations for the new bindings. ### Python Bindings Integration: * **Added Python bindings option in `CMakeLists.txt`**: Introduced the `TOKENIZERS_BUILD_PYTHON` option and the logic to build Python bindings using `pybind11`. This includes creating the `pytorch_tokenizers_cpp` extension module and linking it with the tokenizers library. [[1]](diffhunk://#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR21) [[2]](diffhunk://#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR125-R156) * **New `src/python_bindings.cpp` file**: Implemented Python bindings for tokenizers using `pybind11`. This includes binding classes like `Tokenizer`, `HFTokenizer`, `Tiktoken`, `Llama2cTokenizer`, and `SPTokenizer`. ### Python Package Updates: * **Updated `setup.py` for Python bindings**: Added support for building the Python extension module using CMake and `pybind11`. This includes defining a custom `CMakeBuild` class for handling the build process. * **Modified `pytorch_tokenizers/__init__.py`**: Updated the package to include the new C++ tokenizer bindings and removed older Python implementations. Added error handling for failed imports. ### Testing Enhancements: * **Added `pytest.ini` configuration**: Configured Pytest for the project, including test discovery rules, ignored directories, and markers for different test types. * **Defined Python tests in targets.bzl**: Introduced a `targets.bzl` target for testing the Python bindings (`test_python_bindings.py`). ### Tokenizer Class Changes: * **Added constructors to `Tiktoken` class**: Introduced new constructors to let pybind11 bind init() to constructors (it doesn't support `std::unique_ptr<std::vector<std::string>>`). ### Build System Changes: * **Added Bazel target for Python bindings**: Defined a `targets.bzl` target for building the Python bindings, including dependencies on tokenizer modules and `pybind11`. Reviewed By: jackzhxng Differential Revision: D78053854 Pulled By: larryliu0820
39ed03a
to
c376a3e
Compare
This pull request was exported from Phabricator. Differential Revision: D78053854 |
Summary: This pull request introduces Python bindings for the PyTorch Tokenizers library. It includes changes to support Python bindings in the build system, integration of `pybind11`, and updates to the Python package for distribution. Additionally, it modifies the tokenizer classes and adds testing configurations for the new bindings. ### Python Bindings Integration: * **Added Python bindings option in `CMakeLists.txt`**: Introduced the `TOKENIZERS_BUILD_PYTHON` option and the logic to build Python bindings using `pybind11`. This includes creating the `pytorch_tokenizers_cpp` extension module and linking it with the tokenizers library. [[1]](diffhunk://#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR21) [[2]](diffhunk://#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR125-R156) * **New `src/python_bindings.cpp` file**: Implemented Python bindings for tokenizers using `pybind11`. This includes binding classes like `Tokenizer`, `HFTokenizer`, `Tiktoken`, `Llama2cTokenizer`, and `SPTokenizer`. ### Python Package Updates: * **Updated `setup.py` for Python bindings**: Added support for building the Python extension module using CMake and `pybind11`. This includes defining a custom `CMakeBuild` class for handling the build process. * **Modified `pytorch_tokenizers/__init__.py`**: Updated the package to include the new C++ tokenizer bindings and removed older Python implementations. Added error handling for failed imports. ### Testing Enhancements: * **Added `pytest.ini` configuration**: Configured Pytest for the project, including test discovery rules, ignored directories, and markers for different test types. * **Defined Python tests in targets.bzl**: Introduced a `targets.bzl` target for testing the Python bindings (`test_python_bindings.py`). ### Tokenizer Class Changes: * **Added constructors to `Tiktoken` class**: Introduced new constructors to let pybind11 bind init() to constructors (it doesn't support `std::unique_ptr<std::vector<std::string>>`). ### Build System Changes: * **Added Bazel target for Python bindings**: Defined a `targets.bzl` target for building the Python bindings, including dependencies on tokenizer modules and `pybind11`. Reviewed By: jackzhxng Differential Revision: D78053854 Pulled By: larryliu0820
c376a3e
to
d5ea9b9
Compare
This pull request was exported from Phabricator. Differential Revision: D78053854 |
This pull request introduces Python bindings for the PyTorch Tokenizers library. It includes changes to support Python bindings in the build system, integration of
pybind11
, and updates to the Python package for distribution. Additionally, it modifies the tokenizer classes and adds testing configurations for the new bindings.Python Bindings Integration:
CMakeLists.txt
: Introduced theTOKENIZERS_BUILD_PYTHON
option and the logic to build Python bindings usingpybind11
. This includes creating thepytorch_tokenizers_cpp
extension module and linking it with the tokenizers library. [1] [2]src/python_bindings.cpp
file: Implemented Python bindings for tokenizers usingpybind11
. This includes binding classes likeTokenizer
,HFTokenizer
,Tiktoken
,Llama2cTokenizer
, andSPTokenizer
.Python Package Updates:
setup.py
for Python bindings: Added support for building the Python extension module using CMake andpybind11
. This includes defining a customCMakeBuild
class for handling the build process.pytorch_tokenizers/__init__.py
: Updated the package to include the new C++ tokenizer bindings and removed older Python implementations. Added error handling for failed imports.Testing Enhancements:
pytest.ini
configuration: Configured Pytest for the project, including test discovery rules, ignored directories, and markers for different test types.targets.bzl
target for testing the Python bindings (test_python_bindings.py
).Tokenizer Class Changes:
Tiktoken
class: Introduced new constructors to let pybind11 bind init() to constructors (it doesn't supportstd::unique_ptr<std::vector<std::string>>
).Build System Changes:
targets.bzl
target for building the Python bindings, including dependencies on tokenizer modules andpybind11
.